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Greedy  Learning  of  Graphical  Models  with  Small  Girth 

ABSTRACT 

This  paper  presents  three  new  greedy  algorithms  for  learning  discrete  graphical  models.  The  original  greedy 
algorithm  constructed  the  neighborhood  of  each  node  by  sequentially  adding  nodes  (or  variable)  in  each  step  which 
currently  produced  the  maximum  decrease  in  its  conditional  entropy.  Though  simple,  this  did  not  always  yield  the 
correct  graph  when  there  are  short  cycles,  since  a  non-neighbor  may  produce  most  decrease  in  the  conditional 
entropy  in  a  step  and  it  gets  added  to  its  neighborhood. 

The  new  algorithms  can  overcome  this  problem  in  three  different  ways.  The  recursive  greedy  algorithm  iteratively 
runs  the  greedy  algorithm  in  an  inner  loop,  but  each  time  only  includes  the  last  added  node  in  the  neighborhood  set. 
On  other  hand  the  forward-backward  greedy  algorithm  includes  a  node  deletion  step  in  each  iteration,  which  prunes 
the  incorrect  nodes  from  the  neighborhood  set  that  may  have  been  added  earlier.  Finally  the  greedy  algorithm  with 
pruning  runs  the  greedy  algorithm  until  completion  and  then  removes  all  the  incorrect  neighbors.  We  give  both 
analytical  guarantees  and  empirical  results  for  our  algorithms.  Running  the  algorithms  with  a  candidate  set  of  nodes 
instead  of  all  the  nodes  and  their  greedy  approach  enables  them  to  efficiently  learn  graphs  even  with  small  girth, 
which  the  previous  greedy  and  convex  optimization  based  algorithms  cannot  learn. 


1 


Greedy  Learning  of  Graphical  Models  with  Small  Girth 

Avik  Ray,  Sujay  Sanghavi  and  Sanjay  Shakkottai 


Abstract — This  paper  presents  three  new  greedy  algorithms  for 
learning  discrete  graphical  models.  The  original  greedy  algorithm 
constructed  the  neighborhood  of  each  node  by  sequentially 
adding  nodes  (or  variable)  in  each  step  which  currently  produced 
the  maximum  decrease  in  its  conditional  entropy.  Though  simple, 
this  did  not  always  yield  the  correct  graph  when  there  are 
short  cycles,  since  a  non-neighbor  may  produce  most  decrease 
in  the  conditional  entropy  in  a  step  and  it  gets  added  to  its 
neighborhood. 

The  new  algorithms  can  overcome  this  problem  in  three 
different  ways.  The  recursive  greedy  algorithm  iteratively  runs  the 
greedy  algorithm  in  an  inner  loop,  but  each  time  only  includes 
the  last  added  node  in  the  neighborhood  set.  On  other  hand 
the  forward-backward  greedy  algorithm  includes  a  node  deletion 
step  in  each  iteration,  which  prunes  the  incorrect  nodes  from  the 
neighborhood  set  that  may  have  been  added  earlier.  Finally  the 
greedy  algorithm  with  pruning  runs  the  greedy  algorithm  until 
completion  and  then  removes  all  the  incorrect  neighbors.  We 
give  both  analytical  guarantees  and  empirical  results  for  our 
algorithms.  Running  the  algorithms  with  a  candidate  set  of  nodes 
instead  of  all  the  nodes  and  their  greedy  approach  enables  them  to 
efficiently  learn  graphs  even  with  small  girth,  which  the  previous 
greedy  and  convex  optimization  based  algorithms  cannot  learn. 


I.  Introduction 

Graphical  models  have  been  widely  used  to  tractably  capture 
dependence  relations  amongst  a  collection  of  random  variables 
in  a  variety  of  domains,  ranging  from  statistical  physics,  social 
networks  to  biological  applications  0-0-  A  key  challenge  in 
these  settings  is  in  learning  the  precise  dependence  structure 
among  the  random  variables  -  a  problem  that  in  the  worst  case 
is  known  to  be  NP  hard  in  the  number  of  variables  ©■  How¬ 
ever,  with  restrictions  placed  on  the  class  of  graphical  models 
considered,  it  is  known  that  polynomial  time  algorithms  exist. 
Exploring  the  relationship  between  classes  of  graphical  models 
that  can  be  learnt,  and  the  sample  and  computation  complexity 
of  doing  so  is  an  active  field  of  research.  One  of  the  first  results 
in  this  spirit  is  that  by  Chow  and  Liu  |9j,  where  efficient 
algorithms  for  learning  tree-structured  graphical  models  were 
developed.  Since  then,  there  have  been  several  algorithms 
developed  for  learning  restricted  classes  of  graphical  models 
(see  Section  [LB]  for  more  details). 

A.  Main  Contributions 

In  this  paper  we  propose  three  new  greedy  algorithms  to 
find  the  Markov  graph  for  any  discrete  graphical  model.  While 
greedy  algorithms  (that  learn  the  structure  by  sequentially 

A.  Ray,  S.  Sanghavi  and  S.  Shakkottai  are  with  the  Department  of 
Electrical  and  Computer  Engineering,  The  University  of  Texas  at  Austin, 
USA,  Emails:  avik@utexas.edu,  sanghavi@mail.utexas.edu, 
shakkott@austin.utexas.edu.  An  earlier  version  of  this  work  has 
appeared  in  the  Proceedings  of  the  50th  Annual  Allerton  Conference  on  Com¬ 
munication,  Control,  and  Computing,  2012  We  would  like  to  acknowledge 
NSF  grants  0954059  and  1017525,  and  ARC)  grant  W91  INF-1 1-1-0265. 


adding  nodes  and  edges  to  the  graph)  tend  to  have  low 
computational  complexity,  they  are  known  to  fail  (i.e.,  do  not 
determine  the  correct  graph  structure)  in  loopy  graphs  with 
low  girth  GD-  even  when  they  have  access  to  exact  statistics. 
This  is  because  a  non-neighbor  can  be  the  best  node  at  a 
particular  iteration;  once  added,  it  will  always  remain.  Convex 
optimization  based  algorithms  like  in  |T0|  by  Ravikumar  et 
al.  (henceforth  we  call  this  the  RWL  algorithm)  also  cannot 
provide  theoretical  guarantees  of  learning  in  these  situations. 
These  methods  require  strong  incoherence  conditions  to  guar¬ 
antee  success.  But  such  conditions  may  not  be  satisfied  in  even 
simple  graphs  with  small  girth  JT8).  Example:  If  we  run  the 
existing  algorithms  for  an  Ising  model  on  a  diamond  network 
(Figure  [2|  with  D  =  4  the  performance  plot  in  Figure  [I]  shows 
that  greedy  and  RWL  algorithms  fail  to  leam  the  correct  graph 
even  with  large  number  of  samples. 


Diamond  Nw,  Max  Degree  =  4,  Threshold  Degree  =  3 


Fig.  1:  Performance  of  different  algorithms  in  an  Ising  model 
on  diamond  network  with  6  nodes  (Figure  [2]  with  D  =  4). 
Both  the  Greedy(e )  and  RWL  algorithms  estimate  an  incorrect 
edge  between  nodes  0  and  5  therefore  never  recovers  the 
true  graph  G,  while  our  new  RecGreedy(e),  FbGreedy(e1  a), 
GreedyP(e)  algorithms  succeed. 

In  this  paper,  we  present  three  algorithms  that  overcome  this 
shortfall  of  greedy  and  convex  optimization  based  algorithms. 

•  The  recursive  greedy  algorithm  is  based  on  the  observa¬ 
tion  that  the  last  node  added  by  the  simple,  naive  greedy 
algorithm  is  always  a  neighbor;  thus,  we  can  use  the 
naive  greedy  algorithm  as  an  inner  loop  that,  after  every 
execution,  yields  just  one  more  neighbor  (instead  of  the 
entire  set). 

•  Th e  forward-backward  greedy  algorithm  takes  a  different 
tack,  interleaving  node  addition  (forward  steps)  with  node 
removal  (backward  steps).  In  particular,  in  every  iteration, 
the  algorithm  looks  for  nodes  in  the  existing  set  that  have 
a  very  small  marginal  effect;  these  are  removed.  Note 
that  these  nodes  may  have  had  a  big  effect  in  a  previous 
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iteration  when  they  were  added,  but  the  inclusion  of 
subsequent  nodes  shows  them  to  not  have  enough  of  a 
direct  effect. 

•  Our  third  algorithm,  namely  the  greedy  algorithm  with 
pruning,  first  runs  the  greedy  algorithm  until  it  is  unable 
to  add  any  more  nodes  to  the  neighborhood  estimate. 
Subsequently,  it  executes  a  node  pruning  step  that  iden¬ 
tifies  and  removes  all  the  incorrect  neighbors  that  were 
possibly  included  in  the  neighborhood  estimate  by  the 
greedy  algorithm. 

•  For  all  these  algorithms,  we  show  that  a  simple  node 
selection  step  at  the  beginning  followed  by  any  one  of 
the  recursive,  forward-backward  or  greedy  with  pruning 
algorithms  can  efficiently  learn  the  structure  of  a  large 
class  of  graphical  models  even  when  they  have  small 
cycles.  We  calculate  the  sample  complexity  and  com¬ 
putational  (number  of  iterations)  complexity  for  these 
algorithms  (with  high  probability)  under  non-degeneracy 
and  correlation  decay  assumptions  (Theorems  [3]  and  [4]); 

•  Finally  we  present  numerical  results  that  indicate 
tractable  sample  and  computational  complexity  for  loopy 
graphs  (diamond  graph,  grid). 


B.  Related  Work 

Several  approaches  have  been  taken  so  far  to  learn  the  graph 
structure  of  MRF  in  presence  of  cycles.  These  can  be  broadly 
divided  into  three  classes  -  search  based,  convex-optimization 
based,  and  greedy  methods. 

Search  based  algorithms  like  local  independence  test  (LIT) 
by  Bresler  et  al.  in  ED  and  the  conditional  variation  distance 
thresholding  (CVDT)  by  Anandkumar  et  al.  in  ED  try  to 
find  the  smallest  set  of  nodes  through  exhaustive  search, 
conditioned  on  which  either  a  given  node  is  independent  of  rest 
of  the  nodes  in  the  graph,  or  a  pair  of  nodes  are  independent 
of  each  other.  These  algorithms  have  a  fairly  good  sample 
complexity,  but  due  to  exhaustive  search  they  have  a  high 
computation  complexity.  Also  to  run  these  algorithms  one 
needs  to  know  some  additional  information  about  the  graph 
structure.  For  example  the  local  independence  test  requires  the 
knowledge  of  the  maximum  degree  of  the  graph  and  the  CVDT 
algorithm  requires  the  knowledge  of  the  maximum  size  of  the 
local  separator  for  the  graph. 

In  case  of  Ising  models  a  convex  optimization  based  learn¬ 
ing  algorithm  was  proposed  in  0  by  Ravikumar  et  al.  This 
was  further  generalized  for  any  pairwise  graphical  model  in 
0.  These  algorithms  try  to  construct  a  pseudo  likelihood 
function  using  the  parametric  form  of  the  distribution  such 
that  it  is  convex  and  try  to  maximize  it  over  the  parameter 
values.  The  optimized  parameter  values  in  effect  reveal  the 
Markov  graph  structure.  These  algorithms  have  a  very  good 
sample  complexity  of  f2(A3  log  p),  where  A  is  the  maximum 
degree  of  a  node  and  p  is  the  total  number  of  nodes.  However 
these  algorithms  require  a  strong  incoherence  assumption  to 
guarantee  its  success.  In  fl8)  Bento  et  al.  showed  that  even  for 
a  large  class  of  Ising  models  the  incoherence  conditions  are  not 
satisfied  hence  the  convex  optimization  based  algorithms  fail. 
They  also  show  that  in  Ising  models  with  weak  long  range 


correlation,  a  simple  low  complexity  thresholding  algorithm 
can  correctly  learn  the  graph. 

Recently  a  greedy  learning  algorithm  was  proposed  in  JT3J 
which  tries  to  find  the  minimum  value  of  the  conditional 
entropy  of  a  particular  node  in  order  to  estimate  its  neighbor¬ 
hood.  We  call  this  algorithm  as  Greedy{e).  It  is  an  extension 
of  the  Chow-Liu  algorithm  to  graphs  with  cycles.  It  was  shown 
that  for  graphs  with  correlation  decay  and  large  girth  this 
exactly  recovers  the  graph  G.  However  it  fails  for  graphs  with 
small  cycles.  A  forward-backward  greedy  algorithm  based  on 
convex  optimization  was  also  presented  recently  by  Jalali  et 
al.  in  which  works  for  any  pairwise  graphical  model. 
This  required  milder  assumption  than  in  [|T0)  and  also  gives  a 
better  sample  complexity. 

This  paper  is  organized  as  follows.  First  we  review  the 
definition  of  a  graphical  model  and  the  graphical  model 
learning  problem  in  section  [II]  The  three  greedy  algorithms  are 
described  in  section  III  Next  we  give  sufficient  conditions  for 
the  success  of  the  greedy  algorithms  in  section  IV  In  section [V] 
we  present  the  main  theorems  showing  the  performance  of  the 
recursive  greedy,  forward-backward  and  greedy  with  pruning 
algorithms.  We  compare  the  performance  of  our  algorithm 


with  other  well  known  algorithms  in  section  VI  In  section  VII 


we  present  some  simulation  results.  The  proofs  are  presented 
in  the  appendix. 


II.  Brief  Review:  Graphical  Models 

In  this  section  we  briefly  review  the  general  graphical  model 
and  the  Ising  model.  Let  X  =  (A'1;  X2,  . . . ,  Xp)  be  a  random 
vector  over  a  discrete  set  Xv ,  where  X  =  {1,2, 

Xs  =  (X;  :  i  £  S)  denote  the  random  vector  over  the  subset 
S  C  {1,2, . . .  ,p}.  Let  G  =  ( V,E )  denote  a  graph  having  p 
nodes.  Let  A  be  the  maximum  degree  of  the  graph  G  and  A,; 
be  the  degree  of  the  ith  node.  An  undirected  graphical  model 
or  Markov  random  field  is  a  tuple  M  =  (G,X)  such  that 
each  node  in  G  corresponds  to  a  particular  random  variable 
in  X.  Moreover  G  captures  the  Markov  dependence  between 
the  variables  Xi  such  that  absence  of  an  edge  (i,  j)  implies 
the  conditional  independence  of  variables  Xi  and  X?  given  all 
the  other  variables. 

For  any  node  r  £  V,  let  Afr  denote  the  set  of  neighbors  of  r 
in  G.  Then  the  distribution  P(X )  has  the  special  Markov  prop¬ 
erty  that  for  any  node  r,  Xr  is  conditionally  independent  of 
Av-\{t.}  u,vr  given  Ajyr  =  {Xi  :  i  £  A/}},  the  neighborhood 
of  r,  i.e. 


P(Xr|AV\r)  =nXr\XMr)  (1) 

Ising  Model:  An  Ising  model  is  a  pairwise  graphical  model 
where  X,  take  values  in  the  set  X  =  {  —  1,1}.  For  this 
paper  we  also  consider  the  node  potentials  as  zero  (the  zero 
field  Ising  model).  Hence  the  distribution  take  the  following 
simplified  form. 


Pe(X  =  a^) 


^XP 


"y  @ijxixj 


(2) 
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Algorithm 

Graph  family 

Sample  complexity 

Computation  complexity 

Bresler  et  al. 

A  degree  limited 

n(|T|4aAlogp) 

O  (p^1  logp) 

CVDT 

(A,  7)—  local  separation 

n(|Af|^(A  +  2)logp) 

0  (pA+^) 

RWL 

A  degree  limited,  Ising  model,  incoherence 

II  (A3  logp) 

o(p4) 

Jalali  et  al. 

A  degree  limited,  pairwise  graphical  model,  RSC 

n(AMogp) 

oW) 

Greedy 

A  degree  limited,  large  girth,  correlation  decay 

n(\x\ia  log p) 

0(p2  A) 

RecGreedy 

A  degree  limited,  correlation  decay 

O  (p2  +pA|) 

FbGreedy 

A  degree  limited,  correlation  decay 

/  ,*,410*  \ 

e5(l a)a4  Wgpj 

U  \p  +P(i-«W 

GreedyP 

A  degree  limited,  correlation  decay 

o(p2+p«:i}) 

Bento  et  al. 

A  degree  limited,  Ising  model,  correlation  decay 

^  (  (l-2Atanh0)2  10Sp) 

0(p2) 

TABLE  I:  Performance  comparison  between  different  discrete  graphical  model  learning  algorithms  in  literature.  A  denotes  the 
maximum  degree  of  the  graph  and  p  is  the  number  of  nodes.  X  is  the  alphabet  set  from  which  a  random  variable  take  its  value 
in  the  discrete  graphical  model.  £  represents  the  super-neighborhood  size  (described  in  Section  IV-C[).  e  is  a  non-degeneracy 
parameter  (see  Section [TV|.  a  is  an  input  parameter  to  determine  the  elimination  threshold  in  FbGreedy  algorithm  (see  Section 
III-Bi.  6  is  the  edge  weight  parameter  of  the  Ising  model  in  1 18 1. 


where  Xi,Xj  £  {—1, 1},  9ij  £  R  and  Z  is  the  normalizing 
constant. 

Graphical  Model  Selection:  The  graphical  model  selection 
problem  is  as  follows.  Given  n  independent  samples  Sn  = 
x^2\  . . . ,  x^}  from  the  distribution  P(X),  where  each 
x W  is  a  p  dimensional  vector  arW  =  x^\  ■  •  ■ ,  Xp  £ 

{ 1 . . . . .  m}p,  the  problem  is  to  estimate  the  Markov  graph 
G  corresponding  to  the  distribution  P(X)  by  recovering  the 
correct  edge  set  E.  This  problem  is  NP  hard  in  general  and  has 
been  solved  only  under  special  assumptions  on  the  graphical 
model  structure.  In  some  cases  a  learning  algorithm  is  able  to 
find  the  correct  neighborhood  of  each  node  v  £  V  with  a  high 
probability  and  hence  recover  the  true  topology  of  the  graph. 
Table  [I]  has  a  brief  survey  of  the  most  relevant  methods  for 
discrete  graphical  models. 

We  describe  some  notations.  For  a  subset  S  C  V,  we 
define  P(xs ^  =  P(Xg  =  xs),  xs  £  X^s^.  The  empirical 
distribution  P(X)  is  the  distribution  of  X  computed  from  the 
samples.  Let  i  £  V  —  S,  the  entropy  of  the  random  variable 
Xj  conditioned  on  Xg  is  written  as  1 1  (X,\X  g).  The  empirical 
entropy  calculated  corresponding  to  the  empirical  distribution 
P  is  denoted  by  H.  If  P  and  Q  are  two  probability  measures 
over  a  finite  set  y,  then  the  total  variational  distance  between 
them  is  given  by,  \\P  -  Q \\TV  =  \  \p(v)  ~  Qiv) ]■ 

III.  Greedy  Algorithms 

In  this  section  we  describe  three  new  greedy  algorithms  for 
learning  the  structure  of  a  MRF. 

Main  idea:  The  algorithms  can  be  divided  into  two  steps. 
The  first  step  common  to  all  algorithms  is  a  node  pruning 
step  called  super-neighborhood  selection  (described  in  detail 
in  Section  |IV[).  This  step  generates  a  collection  of  super¬ 
neighborhoods  S  =  {S'i  C  V  :  Ni  C  Si  and  i  ^  Si  Vi  £  V} 
one  for  each  %  £  V,  such  that  |5j|  is  small.  This  super¬ 
neighborhood  set  S  is  the  input  to  the  second  step  of  the 
algorithms  which  is  described  next. 

A.  Recursive  Greedy  Algorithm 

Idea:  Consider  first  the  simpler  setting  when  we  have 


infinite  samples  (i.e.  access  to  the  true  population  quantity). 
The  simple  naive  greedy  algorithm  03  adds  nodes  to  the 
neighborhood  stopping  when  no  further  strict  reduction  in 
conditional  entropy  is  possible.  This  stopping  will  happen 
when  the  true  neighborhood  is  a  subset  of  the  estimated 
neighborhood.  Our  key  observation  here  is  that  the  last  node  to 
be  added  by  the  naive  greedy  algorithm  will  always  be  in  the 
true  neighborhood,  since  inclusion  of  the  last  neighbor  in  the 
conditioning  set  enables  it  to  reach  the  minimum  conditional 
entropy.  We  leverage  this  observation  by  using  the  naive 
greedy  algorithm  as  an  inner  loop;  at  the  end  of  every  run 
of  this  inner  loop,  we  pick  only  the  last  node  and  add  it  to 
the  estimated  neighborhood.  The  next  inner  loop  starts  with 
this  added  node  as  an  initial  condition,  and  finds  the  next  one. 
Hence  the  algorithm  discovers  a  neighbor  in  each  run  of  the 
innermost  loop  and  finds  all  the  neighbors  of  a  given  node  i 
in  exactly  A,;  iterations  of  the  outer  loop. 

The  above  idea  works  as  long  as  every  neighbor  has  a 
measurable  effect  on  the  conditional  entropy,  even  when  there 
are  several  other  variables  in  the  conditioning.  The  algorithm 
is  RecGreedy(e),  pseudocode  detailed  below.  It  needs  a  non¬ 
degeneracy  parameter  e,  which  is  the  threshold  for  how  much 
effect  each  neighbor  has  on  the  conditional  entropy. 

B.  Forward-Backward  Greedy  Algorithm 

Our  second  algorithm  takes  a  different  approach  to  fix 
the  problem  of  spurious  nodes  added  by  the  naive  greedy 
algorithm,  by  adding  a  backward  step  at  every  iteration  that 
prunes  nodes  it  detects  as  being  spurious.  In  particular,  after 
every  forward  step  that  adds  a  node  to  the  estimated  neigh¬ 
borhood,  the  algorithm  finds  the  node  in  this  new  estimated 
neighborhood  that  has  the  smallest  individual  effect  on  the  new 
conditional  entropy.  If  this  is  too  small,  this  node  is  removed 
from  the  estimated  neighborhood. 

The  algorithm,  FbGreedy(e,  a),  is  given  in  pseudocode.  It 
takes  two  input  parameters  beside  the  samples.  The  first  is  the 
same  non-degeneracy  parameter  e  as  in  the  RecGreedy(e) 
algorithm.  The  second  parameter  a  £  (0, 1)  is  utilized  by 
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Algorithm  1  RecGreedy(e) 


Generate  super-neighborhood  S 
for  i  =  1  to  \  V\  do 
N{i)  «-  </> 
iterate  4-  TRUE 
while  iterate  do 
T (*)  <-  JV(i) 
last  t—  0 

complete  <—  FALSE 
while  !  complete  do 

j  =  argminfceSAf(i)  H(Xi\Xf[i),Xk) 
if  H(Xi\Xf^yXj)  <  H(Xi\Xf^ 
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|  then 


last  4—  j 
else 

if  last  !  =  0  then 

N{i)  4-  N(i)  U{last} 
else 

iterate  4—  FALSE 

end  if 

complete  4—  TRUE 

end  if 
end  while 
end  while 
end  for 


the  algorithm  to  determine  the  threshold  of  elimination  in  the 
backward  step.  We  will  see  later  that  this  parameter  also  helps 
to  trade-off  between  the  sample  and  computation  complexity 
of  the  FbGreedy(e,  a)  algorithm.  The  algorithm  stops  when 
there  are  no  further  forward  or  backward  steps. 


C.  Greedy  Algorithm  with  Pruning 

The  third  algorithm  overcomes  the  problem  of  non-neighbor 
inclusion  in  the  Greedy{e)  algorithm  by  adding  a  node  prun¬ 
ing  step  after  the  execution  of  the  greedy  algorithm  (similar  to 
the  backward  step  in  FbGreedy(e,  a)).  In  this  algorithm,  after 
running  the  Greedy  (e)  algorithm,  the  pruning  step  declares 
a  neighbor  node  to  be  spurious  if  its  removal  from  the 
neighborhood  estimate  does  not  significantly  increase  the  final 
conditional  entropy.  These  spurious  nodes  are  removed  to  re¬ 
sult  in  an  updated  neighborhood  estimate  for  each  node.  In  the 
original  Greedy(e )  algorithm  this  step  was  impractical  since 
the  size  of  the  estimated  neighborhood  could  become  very 
large  ED  in  a  general  graph.  However,  our  pre-processing 
step  (super-neighborhood  selection,  see  Section |TV]i  effectively 
precludes  this  possibility  and  leads  to  an  effcient  and  correct 
algorithm. 

The  pseudocode  of  this  greedy  algorithm  with  node -pruning 
-  GreedyP(e)  -  is  given  in  Algorithm  [3]  In  addition  to 
the  samples,  the  input  is  again  a  non-degeneracy  parameter 
e  similar  to  RecGreedy(e)  and  FbGreedy(e,  a)  algorithms. 


Algorithm  2  FbGreedy(e ,  a) 

1:  Generate  super-neighborhood  S 
2:  for  i  =  1  to  \V\  do 
3:  N(i)  <r-  <f) 

4:  added  4—  FALSE 

5:  complete  4—  FALSE 

6:  while  !  complete  do  >  Forward  Step: 

7:  j  =  argmmfc6SA^(i)  H(Xi\XR{i),  Xk) 

8:  if  H(Xi \Xff(i),Xj)  <  H(X,]Xm)  -  §  then 

9:  N®  <- 

10:  added  TRUE 

1 1 :  else 

12:  added  4—  FALSE 

13:  end  if  >  Backward  Step: 

14:  l  =  argminfeeJ?(i)  H{XZ  \X^k) 

15:  if  H{Xi \xmv)  -  H(Xi\Xm)  <  f  then 

16:  N(i)  4-  N(i)\{l} 

17:  else 

18:  if  !  added  then 

19:  complete  4—  TRUE 

20:  end  if 

21:  end  if 

22:  end  while 

23:  end  for 


Algorithm  3  Greedy P(e) 

1:  Generate  super-neighborhood  S 
2:  for  i  =  1  to  \V\  do 
3:  Run  Greedy(e)  within  set  Si 

4:  for  j  e  N(i)  do 

5:  if  H(Xi \X$m)  -  H(Xi\Xm)  <  I  then 

6:  N(i)  4-  N(i)\j 

7:  end  if 

8:  end  for 

9:  end  for 


IV.  Sufficient  Conditions  for  Markov  Graph 
Recovery 

In  this  section  we  describe  the  sufficient  conditions  which 
guarantees  that  the  RecGreedy(e),  FbGreedy(e,  a)  and 
GreedyP(e)  algorithms  recover  the  correct  Markov  graph  G. 

A.  Non-degeneracy 

Our  non-degeneracy  assumption  requires  every  neighbor 
have  a  significant  enough  effect.  Other  graphical  model  learn¬ 
ing  algorithms  require  similar  assumptions  to  ensure  correct¬ 
ness  (T0|,  (TT),  03). 

(Al)  Non-degeneracy  condition:  Consider  the  graphical 
model  M  =  (G,  X),  where  G  =  ( V,E ).  Then  for  all  i  £  V 
and  A  C  V  such  that  A/"»  <£.  A  the  following  condition  holds. 
Let  j  €  Mi  and  j  ^  A.  Then  there  exists  e  >  0  such  that 

H{Xi \XA,Xj)  <  H{Xi\XA)-e 


(3) 
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Thus  by  adding  a  neighboring  node  to  any  conditioning  set 
that  does  not  already  contain  it,  the  conditional  entropy  strictly 
decreases  by  at  least  e.  Also  the  above  condition  together 
with  the  local  Markov  property  ([!])  implies  that  the  conditional 
entropy  attains  a  unique  minimum  at  H{Xi\Xjg-i). 


B.  Correlation  Decay 

Correlation  decay  broadly  means  that  the  influence  of  a  ran¬ 
dom  variable  on  the  distribution  of  another  gradually  decreases 
as  the  path  distance  between  the  corresponding  nodes  increase 
in  the  graph  G.  In  [  1 8 1  Bento  et  al.  showed  that  learning 
graphical  models  become  more  difficult  in  absence  of  some 
sort  of  correlation  decay.  Many  different  forms  of  correlation 
decay  have  been  assumed  in  MRF  learning  algorithms  (n). 
We  assume  a  weak  form  of  correlation  decay  similar 
to  the  weak  spatial  mixing  assumption  in  [20||.  First  we  define 
the  following  quantity. 


Definition  1  Consider  the  graphical  model  M  =  ( G,X ). 
Let  i,j  £  V.  Define  fifij)  =  maxXjtx>  \\P(Xi\Xj  =  x)  - 
P(Xi\Xj  =  x')\\tv-  The  corresponding  function  calculated 
from  the  empirical  distribution  P  is  denoted  as  fifij). 

4>i(j)  denotes  the  maximum  variation  distance  between  the 
conditional  distribution  of  X,  conditioned  on  two  different 
values  of  Xj.  Now  the  correlation  decay  assumption  is  the 
following. 

(A2)  Correlation  decay:  For  the  graphical  model  M  = 
(G,X)  there  exists  a  monotonic  decreasing  function  /  :  Z  — ► 
R  such  that  for  any  i,j  £  V 

<t>iU)  <  (4) 

where  d(i,j)  is  the  graph  distance  between  nodes  i  and  j. 
It  can  be  shown  that  the  correlation  decay  assumption  in  fl3j 
implies  0  for  any  monotonic  decreasing  function  /(.)  (scaled 
appropriately).  Hence  this  is  a  weaker  assumption.  Next  we 
give  an  example  when  the  decay  function  /(.)  is  exponential. 


Example  1  ( Exponential  correlation  decay ) 

It  can  be  shown  that  in  any  graphical  model  M  =  (G,  X)  if 
Dobrushin’s  condition  holds  then  M  exhibits  an  exponential 
correlation  decay.  First  we  restate  the  definition  of  influence 
coefficient  from  0.  0. 

Definition  2  Influence  coefficient:  For  any  i,j  £  V  the 
influence  coefficient  of  node  j  on  node  i  is 


7  =  sup 
iev 


<  1 


(5) 


In  an  Ising  model  |2])  with  maximum  degree  A  and  0t]  =  6 
the  Dobrushin’s  condition  corresponds  to  7  =  A  tanh  20  <  1 
1 2 1 1 .  The  following  lemma  connects  this  to  assumption  (A2). 


Lemma  1  (  [r2T|)  Suppose  Dobrushin ’s  condition  holds  for  a 
Markov  random  field.  Then, 

Mj)  <  7 — 

1^7 

where  7  is  given  by  0- 

Hence  in  this  case  f(x)  =  ^37  is  an  exponentially  decaying 
function. 


C.  Super-Neighborhood  Selection 

In  this  section  we  describe  a  method  to  choose  a  super¬ 
neighborhood  Si  for  each  node  1  £  V  i  11  the  first  step 
of  RecGreedy(e),  FbGreedy(e ,  a)  and  GreedyP(e )  algo¬ 
rithms  when  there  is  correlation  decay.  Recall  that  a  super¬ 
neighborhood  set  Si  for  any  node  j  is  a  subset  of  nodes  which 
contain  the  true  neighborhood  A f. 

Before  we  describe  the  procedure,  we  motivate  the  need 
for  a  super-neighborhood  selection  method.  First,  observe 
that  if  we  run  Algorithm  □  0  and  1 3 1  with  S,  —  l  for 
all  i  £  V  and  with  exact  distribution  P(X )  known,  under 
the  non-degeneracy  assumption  (Al)  the  algorithm  correctly 
outputs  the  true  neighborhood  Afi  with  a  proper  e.  However 
the  problem  is  that  for  an  arbitrary  graphical  model  (or  any 
graphical  model  with  the  super- neighborhood  set  to  be  very 
large),  the  size  of  the  conditioning  set  T(i)  in  RecGreedy(e ) 
or  N(i)  in  FbGreedy(e ,  a)  and  Greedy P(e)  can  also  become 
very  large.  This  implies  that  the  number  of  samples  required 
to  get  a  good  estimate  of  the  conditional  entropy  H^X^Xa) 
is  fI(|A’|l'4l+1)  will  be  exponentially  large  (a  good  estimate 
is  needed  to  ensure  Algorithm  [T]  [2]  and  [3]  give  the  correct 
graph  G  with  a  high  probability).  In  order  to  mitigate  this 
problem  we  need  to  appropriately  bound  the  size  of  the  set 
T{i),  N{i).  To  do  this  we  choose  a  super-neighborhood  Si 
such  that  supi6y  |5i|  :=  £  and  £  =  O(logp).  Then  the  size 
of  the  conditioning  set  never  exceeds  £  and  the  number  of 
samples  needed  by  the  algorithm  (sample  complexity)  will  be 
at  most  polynomial  in  p. 

The  problem  of  super-neighborhood  selection  becomes  eas¬ 
ier  under  the  correlation  decay  assumption  (A2).  Let  /3  be  such 
that, 


Cij=  max  \\P(Xi\Xvv=y)-P(Xi\Xvv  =  z)\\TV 

y,zexlv,~1 

Vk—Zk 

Note  that  due  to  the  Markov  property  of  the  graph  Cij  =  0 
for  all  j  f  Afi-  Dobrushin’s  condition  0,  0  is  the 
following. 

Dobrushin’s  condition:  Let  be  the  influence  coefficient 
of  node  j  on  node  i.  Then  Dobrushin’s  condition  require 


■  Ml)  =  P  (6) 

i£V,jGNi 

The  super-neighborhood  is  then  selected  as  follows. 

Si  =  {jeV\Mj)>^}  (7) 

Remark:  Note  that  there  may  be  other  ways  to  generate 
a  super-neighborhood  based  on  domain  knowledge/structural 
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properties  of  the  system  (e.g.  in  social  networks,  weather 
forecasting).  All  that  is  needed  for  our  algorithms  is  to  have 
a  super-neighborhood  of  small  size. 

Definition  3  (Super-neighborhood  radius)  The  super¬ 
neighborhood  radius  R  is  defined  as 

R  =  min{a:  G  Z|  f{x)  <  fi/2}  (8) 

We  assume  that  R  exists  and  R  does  not  grow  with  p.  i.e., 
R  =  0(1). 

Now  with  correlation  decay  (A2)  this  super-neighborhood 
selection  procedure  (|7]i  is  successful  with  a  high  probability 
with  only  ft  (log  p)  samples,  where  by  success  we  mean  Si 
will  contain  the  true  neighborhood  and  £  =  inaxie  y  |S<| 
is  small.  This  is  shown  by  the  following  lemmas.  We  de¬ 
fine  the  minimum  marginal  probability  Pmrn  as  Pmin  = 
minjev^e*  P(Xi  =  xf). 

Lemma  2  Consider  a  graphical  model  M  =  (G,X)  with 
distribution  P{X),  X  G  Xp.  Let  0  <  8\  <  1.  Then  if  the 
number  of  i.i.d.  samples 

2  log  \X\p  +  log 

0iJ 

we  have  with  probability  at  least  1  —  8\ 

\P{Xi\Xj)  -  P{Xi\Xj)\  <  (9) 

for  all  i,j  G  V,  where  fi  is  given  by  •161- 

Lemma  3  Let  a  graphical  model  M  =  (G.  X)  satisfy  as¬ 
sumption  (A2).  Let  0  <  <5i  <  1.  Then  with  probability  greater 
than  1  —  8\,  Mi  C  Si  for  all  i  €  V  when  the  number  of  i.i.d. 
samples  n  =  fl(log  jh ). 

Lemma  4  Consider  a  graphical  model  M  =  (G,X)  with 
maximum  degree  A  satisfying  assumption  (A2)  with  decay 
function  /(.).  Let  0  <  62  <  1.  Then  with  probability  greater 
than  1  —  6 2  we  have  Sj  <  A^A^-1  when  the  number  of 
samples  n  =  fl(log  J^),  where  R  is  given  by 

Lemmas  [2]  [3]  [4]  follow  from  Azuma’s  concentration  inequal¬ 
ity  as  presented  in  Appendix.  Now  we  try  to  characterize 
the  maximum  super-neighborhood  size  £  in  bounded  degree 
graphs  with  exponential  correlation  decay  and  in  Ising  models. 

Theorem  1  Consider  a  graphical  model  M  =  {G .  X )  with 
maximum  degree  A  satisfying  Dobrushin  ’s  condition  0-  Let 
0  <  64  <  1.  Then  with  probability  greater  than  1  —  84  we 
have  £  <  Alos  <.r—r)P^og  1  when  the  number  of  samples  n  = 
fl(log  jL),  where  fi  is  given  by  equation  © 

Theorem  2  Consider  a  zero  field  Ising  model  with  equal  edge 
weights  0,  distribution  given  by  equation  0-  Let  maximum  de¬ 
gree  A  satisfy  Atanh20  <  1.  Let  0  <  85  <  1.  When  the  num¬ 
ber  of  samples  n  =  fi(log  JJ:),  with  probability  greater  than 
1  —  85  we  have  £  <  Alog((1“Atanh26,)tanhe/2)/log(Atanh2e). 


Theorems  [T]  and  [2]  mainly  follow  from  Lemma  □  and  [4] 
Detailed  proof  is  given  in  Appendix.  A  super-neighborhood 
selection  process  for  graphs  with  exponential  correlation  de¬ 
cay  is  available  in  ED-  However  the  super-neighborhood 
selection  in  our  setting  applies  to  graphs  exhibiting  a  weaker 
form  of  correlation  decay.  Further,  unlike  in  ED  where  the 
main  purpose  for  super-neighborhood  selection  was  to  reduce 
the  computational  complexity  of  the  search  based  algorithm, 
in  our  case  super-neighborhood  selection  reduces  both  the 
sample  and  computational  complexity  of  the  RecGreedy(e), 
FbGreedy(e,  a)  and  GreedyP(e)  algorithms. 

V.  Main  Result 

In  this  section  we  state  our  main  result  showing  the 
performance  of  the  RecGreedy{e),  FbGreedy(e,a)  and 
GreedyP(e )  algorithms.  First  we  state  some  useful  lemmas. 
We  restate  the  first  lemma  from  m,  ©  that  will  be  used 
to  show  the  concentration  of  the  empirical  entropy  H  with 
samples. 

Lemma  5  Let  P  and  Q  be  two  discrete  distributions  over  a 
finite  set  X  such  that  ||P  —  Q||tu  A  Then, 

\H(P)  -  H(Q) |  <  2| \P  —  Q\ \TV  log  1 1  l*ln 1 1 

2\\F  -  G\\ TV 

The  following  two  lemmas  bound  the  number  of  steps  in 
RecGreedy{e),  FbGreedy(e ,  a)  and  Greedy P(e)  algorithms 
which  also  guarantees  their  convergence. 

Lemma  6  The  number  of  greedy  steps  in  each  recursion  of 

the  RecGreedy(e)  and  in  GreedyP(e)  algorithm  is  less  than 
2  log  1*  I 


Proof:  In  each  step  the  conditional  entropy  is  reduced 
by  an  amount  at  least  e/2.  Since  the  maximum  reduction  in 
entropy  possible  is  iT(2Q )  <  log  |  | ,  the  number  of  steps  is 

upper  bounded  by  2l°sJx\ . 

■ 

Note  that  the  Greedy P(e)  algorithm  will  take  at  least  A 

steps  to  include  all  the  neighbors  in  the  conditioning  set.  Hence 

2  log  |  A  |  >  ^ 

e  — 

Lemma  7  The  number  of  steps  in  the  FbGreedyie ,  a)  is 
upper  bounded  by  • 

Proof:  Note  that  as  long  as  the  forward  step  is  active 
(which  occurs  till  all  neighbors  are  included  in  the  condition¬ 
ing  set  N(i)),  in  each  step  the  conditional  entropy  reduces 
by  at  least  (1  —  a)e/2.  Hence  all  the  neighbors  are  included 
within  steps.  The  number  of  non-neighbors  included 

in  the  conditioning  set  is  also  bounded  by  ■  Thus  it 

will  take  at  most  the  same  number  backward  steps  to  remove 

the  non-neighbors.  Hence  the  total  number  of  steps  is  at  most 

4  log  1*1 

(l-a)e  ' 

■ 

We  now  state  our  main  theorem  showing  the  performance 
of  Algorithms  |T]  [2]  and  [3] 


n  > 


32|  A’p 
B2P2  . 

f  x  777,7,1 
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Theorem  3  Consider  a  MRF  over  a  graph  G  with  maximum 
degree  A,  having  a  distribution  P(X). 

1)  Correctness  (non-random):  Suppose  (Al)  holds  and  the 
RecGreedy(e),  FbGreedy(e,a )  and  Greedy P(e)  algorithms 
have  access  to  the  true  conditional  entropies  therein,  then  they 
correctly  estimate  the  graph  G. 

2)  Sample  complexity:  Suppose  (Al)  holds,  super¬ 
neighborhoods  Si  are  given  and  super-neighborhood 
size  Si  <  £,  for  all  i  £  V.  Let  0  <  <5  <  1. 


When  the  number  of  samples  n  =  fl 


2  log  \X\/  e. 


log? 


the  RecGreedy(e)  correctly  estimates  G  with  probability 
greater  than  1  —  S. 

When  the  number  of  samples  n  = 

(I  V|41og  \  X  \  /  (tl  —  <x)  e)  \ 

ts(i-a)a. 4 - log|)  the  FbGreedy(e,  a) 

correctly  estimates  G  with  probability  greater  than  1  —  <5, 
for  0  <  a  <  1. 

- - 1 - J5 -  lOg  |  ) 

the  GreedyP(e)  correctly  estimates  G  with  probability 
greater  than  1  —  5. 


The  proof  of  correctness  with  true  conditional  entropies 
known  is  straightforward  under  non-degenerate  assumption 
(Al).  The  proof  in  presence  of  samples  is  based  on  Lemma 
@  similar  to  Lemma  2  in  (DJ  showing  the  concentration 
of  empirical  conditional  entropy,  which  is  critical  for  the 
success  of  RecGreedy(e),  FbGreedy(e,a)  and  GreedyP(e) 
algorithms.  We  show  that  when  super-neighborhoods  S',  are 
given  and  S',  <  f  for  all  i  £  V,  with  sufficiently  many 
samples  the  empirical  distributions  and  hence  the  empirical 
conditional  entropies  also  concentrate  around  their  true  values 
with  a  high  probability.  This  will  ensure  that  algorithms  [T]  [2] 
and  [3]  correctly  recover  the  Markov  graph  G.  The  complete 
proof  is  presented  in  Appendix. 


Lemma  8  Consider  a  graphical  model  AL  =  (G,X)  with 
distribution  P(X).  Let  0  <  £3  <  1.  If  the  number  of  samples 


n  > 


8|A’|2(s+2) 

c3 


(s  +  l)log2p|A’|  +logT*- 
03 


then  with  probability  at  least  1  —  ^3 


\H(Xl\Xs)-H(Xl\Xs)\  <C 

for  any  S  C  V  such  that  S’  <  s. 

Lemma  [8]  follows  from  Lemma  [5]  and  Azuma’s  inequal¬ 
ity.  Although  the  sample  complexities  of  RecGreedy(e), 
FbGreedy(e,  a)  and  GreedyP(c)  algorithms  are  slightly 
more  than  other  non-greedy  algorithms  (TO),  jTT),  (14),  the 
main  appeal  of  these  greedy  algorithms  lie  in  their  low 
computation  complexity.  The  following  theorem  characterizes 
the  computation  complexity  of  Algorithms  [T]  [2]  and  [3]  When 
calculating  the  run-time,  each  arithmetic  operation  and  com¬ 
parison  is  counted  as  an  unit-time  operation.  For  example  to 
execute  line[T0]in  Algorithm[T|  each  comparison  takes  an  unit¬ 
time  and  each  entropy  calculation  takes  0(n )  time  (since  there 
are  n  samples  using  which  the  empirical  conditional  entropy 


is  calculated).  Since  there  are  at  most  |S}|  <  £  comparisons 
the  total  time  required  to  execute  this  line  is  0(n£). 

Theorem  4  (Run-time)  Consider  a  graphical  model  M  = 
(G,X),  with  maximum  degree  A,  satisfying  assumptions  (Al) 
and  |  S';  |  <  £,  for  all  i  £  V.  Then  the  second  step  has  an 
expected  run-time  of, 

•  0(5pt;3n  +  (1  —  5)f  A£n)  for  the  RecGreedy(e)  algo¬ 
rithm. 

•  for  the  FbGreedy(e,a )  algorithm. 

•  0(Spf(f,  +  1  )n  +  (1  —  <5)f  (£+  l)n)  for  the  Greedy P(e) 
algorithm. 

The  proofs  of  Theorem  0]  are  given  in  Appendix. 

Remark:  Suppose  that  f .  Then  if  we  take  a  <  1 

the  FbGreedy(e ,  a)  has  a  better  run  time  guarantee  than  the 
RecGreedy(e)  algorithm  for  small  S.  But  when  A£  < 
then  the  RecGreedy(e)  algorithm  has  a  better  guarantee. 
Also  when  a  <  -q-j-,  FbGreedy(e,a)  has  a  better  runtime 
guarantee  than  Greedy P(e)  algorithm.  Note  that  the  super¬ 
neighborhood  selection  step  in  Equation  0  has  an  additional 
complexity  of  0(p2). 

VI.  Performance  Comparison 

In  this  section  we  compare  the  performance  of  the 
RecGreedy(e),  FbGreedy(e,a)  and  GreedyP(e)  algorithms 
with  other  graphical  model  learning  algorithms. 

A.  Comparison  with  Greedy(e)  algorithm: 

The  RecGreedy(e),  FbGreedy(e,a)  and  Greedy P(e)  al¬ 
gorithms  are  strictly  better  than  the  Greedy(e)  algorithm  in 
p3).  This  is  because  A I  g  o  r  i  t  h  m  s  |T|  [2]  a  11  d  [3]  a  I  w  ay  s  find  the  cor¬ 
rect  graph  G  when  the  Greedy(e)  finds  the  correct  graph,  but 
they  are  applicable  to  a  wider  class  of  graphical  models  since 
they  do  not  require  the  assumption  of  large  girth  to  guarantee 
its  success.  Further  the  correlation  decay  assumption  (A2)  in 
this  paper  is  weaker  than  the  assumption  in  [  1 3 1 .  Note  that 
RecGreedy(e),  FbGreedy(e,  a)  and  GreedyP(e)  algorithms 
use  the  Greedy(e)  algorithm  as  an  intermediate  step.  Hence 
when  Greedy(e)  finds  the  true  neighborhood  A"  of  node  i, 
RecGreedy(e)  algorithm  will  find  the  correct  neighborhood  in 
each  of  the  recursive  steps  and  FbGreedy(e ,  a),  GreedyP(e) 
algorithms  output  the  correct  neighborhood  directly  without 
having  to  utilize  any  of  the  backward  steps  or  the  pruning 
step  respectively.  Hence  RecGreedy(e),  FbGreedy(e,  a)  and 
Greedy P(e)  algorithms  also  succeed  in  finding  the  true  graph 
G.  We  now  demonstrate  a  clear  example  of  a  graph  where 
Greedy(e)  fails  to  recover  the  true  graph  but  the  Algorithms 
[T  [2]  and  [3]  are  successful.  This  example  is  also  presented  in 
|13|.  Consider  an  Ising  model  on  the  graph  in  Figure  [2]  We 
have  the  following  proposition. 

Proposition  1  Consider  an  Ising  model  with  V  = 
{0, 1, . . . ,  D,  D  +  1}  and  E  =  {(0,  i),  (i,  D  +  1)  Vi  :  1  <  i  < 
D}  with  a  distribution  function  P(x)  =  Y\(ij)^E  e°XiXj , 
Xi  £  {1,-1}.  Then  with  D  >  logCofh(29)  +  1  we  have 


Fig.  2:  An  example  of  a  diamond  network  with  D  +  2 
nodes  and  maximum  degree  D  where  Greedy(e)  fails  but 
RecGreedyf),  FbGreedy(e ,  a)  and  GreedyP(e)  algorithms 
correctly  recover  the  true  graph. 


H(X0\Xd+1)  <  H(X olXi) 

The  proof  follows  from  straightforward  calculation  (see 
Appendix).  Hence  for  the  Ising  model  considered  above  (Fig¬ 
ure  2b  with  D  >  Iog  CoS^9g)  +  1  the  Greedy  (e)  incorrectly 
includes  node  D  +  1  in  the  neighborhood  set  in  the  first  step. 
However  with  an  appropriate  e  the  MRF  satisfies  assumption 
(Al).  Hence  by  taking  Si  =  V ,  Theorem  [3]  ensures  that  the 
RecGreedyf),  FbGreedy(e,  a)  and  GreedyP(e)  algorithms 
correctly  estimate  the  graph  G. 


B.  Comparison  with  search  based  algorithms: 

Search  based  graphical  model  learning  algorithms  like  the 
Local  Independence  Test  (LIT)  by  Bresler  et  al.  dj  and 
the  Conditional  Variation  Distance  Thresholding  (CVDT)  by 
Anandkumar  et  al.  1 14 1  generally  have  good  sample  com¬ 
plexity,  but  high  computation  complexity.  As  we  will  see  the 
RecGreedyie),  FbGreedy{e,  a)  and  GreedyP(e)  algorithms 
have  slightly  more  sample  complexity  but  significantly  lower 
computational  complexity  than  the  search  based  algorithms. 
Moreover  to  run  the  search  based  algorithms  one  needs  to 
know  the  maximum  degree  A  for  LIT  and  the  maximum 
size  of  the  separator  y  for  the  CVDT  algorithm.  However  the 
greedy  algorithms  can  be  run  without  knowing  the  maximum 
degree  of  the  graph. 

For  bounded  degree  graphs  the  LIT  algorithm  has  a  sample 
complexity  of  fi(|  A|4AA  log  -#).  Without  any  assumption  on 
the  maximum  size  of  the  separator,  for  bounded  degree  graphs 
the  CVDT  algorithm  also  has  a  similar  sample  complexity  of 
fi(|A|2A(A  +  2)  log|).  Note  that  the  quantity  in  the 
sample  complexity  expression  for  CVDT  algorithm  (Theorem 
2  in  [  1 4 1 )  is  the  minimum  probability  of  P(X$  =  x$)  where 
\S\  <  y+1.  This  scales  with  A  as  Pmin  <  \x^+i  ■  For  general 
degree  bounded  graphs  we  have  y  =  A.  The  sample  com¬ 
plexity  for  RecGreedyie),  GreedyP{e)  and  FbGreedy{e ,  a) 
algorithms  is  slightly  higher  at  Cl  ^ 
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A).  However  the  computation  complexity  of  the  LIT  al¬ 
gorithm  is  0(p2A+1\ogp)  and  that  of  the  CVDT  al¬ 
gorithm  is  0(| X\ApA+2n),  which  is  much  larger  that 
0(|A£n)  for  RecGreedy(e)  algorithm,  0(  ^1f’a)e£,'n)  for  the 
FbGreedy(e,  a)  algorithm  and  0(| (£  +  l)n)  for  GreedyP{e) 
algorithm  (since  £  =  O(logp)  and  £  <  AB  when  (A2)  holds). 
Recall  however  that  using  the  correlation  decay  property  and 
super-neighborhood  selection,  the  computation  complexity  of 
search  based  algorithms  can  be  decreased.  In  [|  IT) ,  Bresler 
et  al.  showed  that  by  assuming  exponential  correlation  decay 
(with  parameter  u)  and  a  super-neighborhood  selection,  the 

A  lpg(4//j ) 

LIT  algorithm  has  a  run-time  of  0(pA  >•  n).  We  have 

the  following  proposition  for  the  CVDT  algorithm. 


Proposition  2  Consider  a  graphical  model  M  =  (G ,  X ), 
where  G  =  ( V ,  E )  have  maximum  degree  A,  satisfying 
correlation  decay  (A2).  Then  by  super-neighborhood  se¬ 
lection  the  CVDT  algorithm  has  an  expected  run-time  of 
0(pA<-A+VR\X\A  n),  when  the  super-neighborhood  is  chosen 
as  (|7j. 

However  with  correlation  decay  (A2)  the  run-time  of  Al¬ 
gorithms  |l|  |2l  and  [3|  are  OfBAR+1n),  0( j^A Rn)  and 
Ol'j ARri)  respectively  still  smaller  than  the  LIT  and  CVDT 
algorithms. 


C.  Comparison  with  convex  optimization  based  algorithms: 

In  ffO]  Ravikumar  et  al.  presented  a  convex  optimiza¬ 
tion  based  learning  algorithm  for  Ising  models,  which  we 
have  referred  as  the  RWL  algorithm.  It  was  later  extended 
for  any  pairwise  graphical  model  by  Jalali  et  al.  in  G3- 
These  algorithms  assume  extra  incoherence  or  restricted  strong 
convexity  conditions  hold,  in  which  case  they  have  a  low 
sample  complexity  of  H(A3  log  p).  However  these  algorithms 
have  a  computation  complexity  of  Off)  higher  than  the 
RecGreedy[e),  FbGreedyie,  a)  and  GreedyP(e)  algorithms. 
Moreover  the  greedy  algorithms  we  propose  are  applicable  for 
a  wider  class  of  graphical  models.  Finally  these  optimization 
based  algorithms  require  a  strong  incoherence  property  to 
guarantee  its  success;  conditions  which  may  not  hold  even  for 
a  large  class  of  Ising  models  as  shown  by  Bento  et  al.  in  fl8). 
They  also  prove  that  the  RWL  algorithm  fails  in  a  diamond 
network  (Figure  [2]>  for  a  large  enough  degree,  whenever  there 
is  a  strong  correlation  between  non-neighbors,  our  algorithm 
successfully  recovers  the  correct  graph  in  such  scenarios.  In 
our  simulations  later  we  will  see  that  the  failure  of  the  RWL 
algorithm  for  the  diamond  network  exactly  corresponds  to  the 

case  A  >  D*h  =  - 2 e,  +  1,  which  is  also  when  the 

Greedyie)  fails.  In  1 1 8 1  the  authors  prove  that  for  a  given 
A  the  RWL  algorithm  fails  when  6  <  9t  and  this  critical 
threshold  Or  behaves  like  I  .  Now  if  we  define 


90  =  max{(9  : 


29 

log  cosh  (2 9) 


+  1  >  A} 


(10) 


Then  from  our  simulations  for  all  9  <  90  the  RWL 
algorithm  fails.  Also  this  9 0  is  almost  equal  to  ^ .  Hence  we 
make  the  following  conjecture. 
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Conjecture  1  The  RWL  algorithm  fails  to  recover  the  correct 
graph  in  the  diamond  network  exactly  when  9  <  9q.  9q  given 
by  equation  (TO). 

In  fl9|  Jalali  et  al.  presented  a  forward-backward  algorithm 
based  on  convex  optimization  for  learning  pairwise  graphical 
models  (as  opposed  to  general  graphical  models  in  this  paper). 
It  has  even  lower  sample  complexity  of  f2(A2  log  p)  and  works 
under  slightly  milder  assumptions  than  the  RWL  algorithm. 


D.  Which  greedy  algorithm  should  we  use ? 

From  the  above  performance  comparison  we  can  say  that 
RecGreedy(e),  FbGreedy(e,  a)  and  GreedyP(e)  algorithms 
can  be  used  to  find  the  graph  G  efficiently  when  the  super¬ 
neighborhood  size  is  small  (£  =  O(logn))  in  discrete  graph¬ 
ical  models  with  correlation  decay.  A  natural  question  to  ask 
then  is  which  among  these  three  greedy  algorithms  should 
we  use?  The  answer  depends  on  the  particular  application. 
In  terms  of  sample  complexity  theoretically  FbGreedy(e ,  a) 
has  a  higher  sample  complexity  than  RecGreedy(e)  and 
GreedyP(e)  algorithms.  However  the  difference  is  not  much 
for  a  constant  a  and  in  our  experiments  we  see  all  the 
three  greedy  algorithms  have  similar  sample  complexities 
(see  Section  VII I.  The  theoretical  guarantee  on  computation 
complexity  also  varies  depending  on  parameters  a,  A  and  £. 
However  from  our  experiments  we  see  RecGreedy(e)  has 
much  higher  run-time  than  FbGreedy(e,  a)  and  GreedyP(e ) 
algorithms  which  show  similar  run-times.  We  conclude  that 
in  practical  applications  it  is  better  to  use  FbGreedy(e ,  a) 
and  GreedyP(e)  algorithms  than  the  RecGreedy(e)  algo¬ 
rithm.  Now  if  we  know  the  bound  on  maximum  degree  A, 
after  running  the  Greedy(e)  algorithm  if  the  size  of  the 
estimated  neighborhood  set  N(i)  is  considerably  higher  than 
A  this  indicates  there  are  large  number  of  non-neighbors. 
In  such  cases  GreedyP(e)  may  take  a  considerable  time  to 
remove  these  non-neighbors  during  the  node  pruning  step  and 
FbGreedy(e,  a)  algorithm  could  have  removed  much  of  these 
nodes  in  earlier  iterations,  when  the  size  of  the  conditioning 
set  was  still  small,  resulting  in  less  computation.  This  calls  for 
the  use  of  FbGreedy(e,  a)  algorithm  in  these  cases.  Similarly 
when  size  of  the  set  N(i)  returned  by  the  Greedy(e)  is 
comparable  or  slightly  greater  than  A  it  will  be  more  efficient 
to  use  the  Greedy  P{e)  algorithm  (for  example  in  the  diamond 
network  and  grid  network  as  shown  in  Section  VII  i. 


VII.  Simulation  Results 

In  this  section  we  present  some  simulation  results  character¬ 
izing  the  performance  of  RecGreedy(e),  FbGreedy{e ,  a)  and 
GreedyP(e)  algorithms.  We  compare  the  performance  with 
the  Greedy{e)  algorithm  [|T3 1  and  the  logistic  regression  based 
RWL  algorithm  JTO)  in  Ising  models.  We  consider  two  graphs, 
a  4x4  square  grid  (Figure [3])  and  the  diamond  network  (Figure 
|2j.  In  each  case  we  consider  an  Ising  model  on  the  graphs. 
For  the  4x4  grid  we  take  the  edge  weights  9  £  {.25,  —.25}, 
generated  randomly.  For  the  diamond  network  we  take  all 
equal  edge  weights  9  =  .25  or  .5.  Independent  and  identically 
distributed  samples  are  generated  from  the  models  using  Gibbs 


sampling  and  the  algorithms  are  run  with  increasing  number 
of  samples.  We  implement  the  RWL  algorithm  using  t\  — 
logistic  regression  solver  by  Koh  et  al.  f24)  and  our  algorithms 
using  MATLAB.  The  parameter  e  for  the  greedy  algorithms 
and  the  l\  regularization  parameter  A  for  the  RWL  algorithm 
are  chosen  through  cross  validation  which  gives  the  least 
estimation  error  on  a  training  dataset.  From  Theorem  [3]  and 
[4]  we  see  that  increasing  a  reduces  the  sample  complexity  of 
FbGreedy(e ,  a)  algorithm  but  increases  its  run-time  and  vice 
versa.  Hence  for  practical  applications  a  can  be  chosen  to 
trade-off  between  sample  complexity  and  run-time  to  best  suit 
the  application  requirements.  In  our  experiments  a  was  taken 
as  .9. 


Fig.  3:  A  4x4  grid  with  A  =  4  and  p  =  16  used  for 
the  simulation  of  the  RecGreedy(e),  FbGreedy(e ,  a)  and 
GreedyP(e)  algorithms. 


First  we  show  that  for  the  diamond  network  (Figure  [2} 
whenever  D  >  Dth  =  logc2,fh(2m  + 1  the  RWL  algorithm  fails 
to  recover  the  correct  graph.  We  run  the  RWL  algorithm  in 
diamond  network  with  increasing  maximum  degree  D  keeping 

9  fixed.  We  take  9  =  .25  for  which  Dth  =  i - 2 or'  +1  = 

5.16.  The  performance  is  shown  in  Figure  |4l  We  clearly  see 
that  the  failure  of  the  RWL  algorithm  in  diamond  network 
corresponds  exactly  to  the  case  when  D  >  Dth.  The  RWL 
algorithm  fails  since  it  predicts  a  false  edge  between  nodes  0 
and  D  +  1.  This  is  surprising  since  this  is  also  the  condition 
in  Proposition  [T]  which  describes  the  case  when  Greedy  (e) 
algorithm  fails  for  the  diamond  network  due  to  the  same  reason 
of  estimating  a  false  edge.  In  some  sense  D  =  Dth  marks  the 
transition  between  weak  and  strong  correlation  between  non¬ 
neighbors  in  the  diamond  network,  and  both  Greedy(e)  and 
RWL  algorithms  fail  whenever  there  is  a  strong  correlation. 
However  see  next  that  our  greedy  Algorithms  [T|  [2]  and  [3] 
succeed  even  when  D  >  Dth- 

Figure  [T|  shows  the  performance  of  the  various  algorithms 
in  the  case  of  the  diamond  network  with  p  =  6,  9  =  .5 
and  D  =  4  >  Dth  =  3.3.  The  Greedy(e)  and  RWL  algo¬ 
rithms  are  unable  to  recover  the  graph  but  the  RecGreedy(e), 
FbGreedy(e,  a)  and  GreedyP(e )  recover  the  true  graph 
G,  they  also  show  the  same  error  performance.  However 
Figure  [5]  shows  that  Greedy P(e)  has  a  better  runtime  than 
the  RecGreedy{e)  and  FbGreedy(e,a )  algorithms  for  this 
diamond  network. 

Figure  [6]  shows  the  performance  of  the  different  algorithms 
for  a  4  x  4  grid  network.  We  see  that  for  this  network 
the  RWL  algorithm  shows  a  better  sample  complexity  than 
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Fig.  4:  Performance  of  the  RWL  algorithm  in  diamond  network 
of  Figure  [2]  for  varying  maximum  degree  with  9  =  .25  and 
Dth  =  5.  RWL  fails  whenever  D  >  Dfj,. 


Avg.  runtime  in  Diamond  Network 


Fig.  5:  Figure  showing  the  average  runtime  performance  of 
RecGreedy(e),  FbGreedy(e,  a)  and  GreedyP(e)  algorithms 
for  the  diamond  network  with  p  =  6,  A  =  4,  for  varying 
sample  size. 


RecGreedy(e),  FbGreedy(e ,  a)  or  GreedyP(e)  as  predicted 
by  the  performance  analysis.  This  network  exhibits  a  weak 
correlation  among  non-neighbors,  hence  the  Greedy(e)  is 
able  to  correctly  recover  the  graph,  which  obviously  implies 
that  the  RecGreedy(e),  FbGreedy(e ,  a)  and  GreedyP(e) 
also  correctly  recovers  the  graph,  and  all  have  the  same 
performance. 

Figure  [7] shows  that  the  GreedyP(e)  algorithm  also  has  the 
best  runtime  in  the  4x4  grid  network  among  the  new  greedy 
algorithms. 
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Equation  0  implies  node  j  is  included  in  the  super¬ 
neighborhood  Si.  Hence  for  all  i  £  V,  J\fi  C  Si  with 
probability  greater  than  1  —  <5i. 


Lemma  |4| 

Proof:  Let  /i  =  )  -  f  ( R  +  1  j .  From  Lemma  [2]  we  know 


that  for  n  >  ff) 


2  log  \X\p  +  log  2 

\P{Xi\Xj)  -  P(Xi\Xj)\  < 


we  have 
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4|*| 


for  all  i,  j  £  V. 

Then  for  any  j  £  V  such  that  d(i,j)  >  R  +  1  we  have. 


Appendix 

In  this  section  we  present  the  proofs  of  Lemma  [2]  [3]  [4]  [8] 
Theorem  [T]  [2]  [3]  [4]  and  Proposition  |T]  [2] 

Lemma  [2] 

Proof:  Using  Azuma’s  inequality  we  get  for  any  Xi,Xj 


P(\P(xi,Xj)  -  P(xi,Xj)\  >  71)  < 

< 


2  exp (—27 fn) 


Taking  union  bound  over  all  xt ,  Xj  £  X  and  i,j  £  V  we 
get  with  probability  at  least  1  —  <$1 
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f(R+l)  +  P  <  | 


Hence  with  the  number  of  samples  n  =  O(log^)  with 
probability  greater  than  1  -  d-j  all  nodes  j  £  V  such  that 
d{i,j)  >  R+ 1  are  not  included  in  Sj .  Hence  |5,|  <  AjA^-1 
with  probability  greater  than  1  —  52- 


\P{Xi,Xj)  -  P(xi,Xj) |  <  71  Vxi,Xj 


Taking  7 1  =  we  get  with  probability  at  least  1  —  <5i 


\P(Xi\Xj)  -  P{Xi\Xj)\  < 
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Lemma  U 

Proof:  From  Lemma  [2]  it  is  clear  that  if  number  of 
samples  n  =  0(log  j^-)  then  with  probability  at  least  1  — 
equation  0  holds  for  all  i,j  £  V.  Then  for  any  i  £  V.j  £  Aft 
with  probability  at  least  1  —  <5i  we  have 


Theorem  [l] 

Proof:  From  Lemma  [T]  we  know  that  for  any  graphical 
model  satisfying  Dobrushin’s  condition  the  correlation  decay 
function  is  exponential,  f{x)  =  .  Now  from  Lemma 

|4j  we  know  that  with  n  =  U(log  samples  the  super¬ 
neighborhood  size  is  bounded  by  £  <  AR  with  high  probabil¬ 
ity.  For  exponential  correlation  decay  the  super-neighborhood 
radius  can  be  bounded  by  R  <  log  jffjp  /  log  .  The  theorem 
follows. 


Theorem  [2] 

Proof:  It  is  easy  to  show  that  in  a  zero  field  Ising 
model  with  maximum  degree  A  and  edge  weights  9  the 
Dobrushin’s  condition  is  satisfied  for  7  =  A  tanli  20  <  1. 
Hence  such  Ising  model  will  show  exponential  correlation 
decay  and  from  Theorem  jlj  with  n  =  f  I  ( log  f-)  samples 
with  high  probability  the  super-neighborhood  size  will  be 
bounded  by  £  <  Alog  u— np/log  1 .  Now  as  shown  in  1 18 1 
the  correlation  between  neighbors  X,  and  Xj  can  be  bounded 
by  Cij  =  E[XjXj]  >  tanh0.  However  from  symmetry  of  the 
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partition  function  as  argued  in  |22]  we  have  P(X,  =  1 .  X3  = 
-1)  =  P(Xi  =  -1  ,Xj  =  1)  and  P(Xt  =  1  ,Xj  =  1)  = 
P(Xt  =  —  l,Xj  =  —1).  Hence  it  can  be  shown  that  even 
(j>i(j)  >  tanhd.  Now  taking  7  =  Atanh20  and  /3  =  tanhd 
in  Theorem  [I]  gives  the  result. 


Lemma  |8| 

Proof:  The  proof  is  similar  to  that  in  |13|.  Let  S  C  V 
such  that  \S\  <  s.  For  any  i  G  V  using  Azuma’s  inequality 
we  get, 


P(\P(xi,xs)  -  P(xi,xs)\  >  73)  <  2exp(-27|n) 

2  S3 

-  mm-*' (say) 

Now  taking  union  bound  over  all  *  G  V,  S  C  V,  |  .S'  <  s 
and  all  27  G  X,  xs  G  with  probability  at  least  1  —  £3  we 
have  | P(xi,xs)  —  P(xi,xs) \  <  73,  for  any  i  G  V,  S  C  V, 
|  S' |  <  s.  This  implies 


\\P(Xi,Xs)-P(Xi,Xs)\\Tv  < 

\\P(XS)  -  P(Xs)\\tv  <  ^73 


Now  taking  73  =  2b6\x\‘+2  anc*  using  Lemma  5  we  get. 


\HiXi\Xs)  -  H(Xi\Xs)\  <  \H(Xi,Xs)-H(Xi,Xs)\  + 
\H(XS)  -  H(XS)\ 

2\\P(Xi,Xs)  -  PjX^Xs^TV 

\X\ 

,  \x\ 

log  — ^ - b 

2\\P(Xi,Xs)-P(Xi,Xs)\\Tv 

2\\P(Xs)-P(Xs)\\tv  {  \X\ 

1*1  g  2\\P(XS)  -  P(Xs)\\tv 

<  2\x\V\Wxi  <  y 

■ 

Theorem  |3] 

Proof:  The  proof  of  correctness  when  P(X)  is  known 
is  straight  forward.  From  local  Markov  property  0  the 
conditional  entropy  H(Xi\Xj^f.)  =  H(Xj\Xy)  <  H(Xf\XA) 
for  any  set  A  not  containing  all  the  neighbors  A/j.  From 
degeneracy  assumption  (Al)  including  a  neighboring  node  in 
the  conditioning  set  always  produce  a  decrease  in  entropy  by  at 
least  e.  In  RecGreedy(e)  in  each  iteration  the  algorithm  runs 
till  all  the  neighbors  A Aj  are  included  in  the  conditioning  set 
and  the  last  added  node  is  always  a  neighbor.  In  GreedyP(e) 
nodes  are  added  till  all  neighbors  have  been  included  in  the 
conditioning  set.  Then  in  the  pruning  step  removing  a  non¬ 
neighbor  does  not  increase  the  entropy,  therefore  all  spurious 
nodes  are  detected  and  removed.  In  FbGreedy(e ,  a)  each 
iteration  decrease  entropy  by  at  least  (1  —  a)e/2.  Since  the 
entropy  is  bounded  it  terminates  in  a  finite  number  of  steps  and 


minimum  is  reached  only  when  all  neighbors  have  been  added 
to  the  conditioning  set.  All  spurious  nodes  get  eliminated  by 
the  backward  steps  (in  earlier  iterations  or  after  all  neighbors 
are  added). 

Now  we  give  the  proof  of  sample  complexity  when  we 
have  samples.  Define  the  error  event  £  =  { 73'  C  V.  \ S  < 
£|  \H(Xi\Xg)  —  H(Xi\Xg)\  >  |}.  Note  that  when  £c  occurs 
we  have  for  any  i  G  V,  j  £  Afi,  A  C  V\{i,j},  |A|  <  £ 

H{Xi\XA)-H{Xi\XA,Xj)  >  H(Xi\XA ) 
-H(Xi  (11) 

which  follows  from  equation  (|5J. 

Proof  for  RecGreedy(e)  algorithm:  We  first  show  that  when 
£ c  occurs  the  RecGreedy(e)  correctly  estimates  the  graph  G. 
The  proof  is  by  induction.  Let  A/i  =  {j  1, . . . ,  Ja,}  C  Si  since 
super-neighborhoods  Si  are  given.  Let  r  denote  the  counter 
indicating  the  number  of  times  the  outermost  while  loop  has 
run  and  s  be  the  counter  indicating  the  number  of  times  the 
inner  while  loop  has  run  in  a  particular  iteration  of  the  outer 
while  loop.  Clearly  s  <  S,  \ .  In  the  first  step  since  T[i)  =  <f> 
the  algorithm  finds  the  node  k  G  Si  such  that  H{Xi\Xf)  is 
minimized  and  adds  it  to  T(i).  Suppose  it  runs  till  s  =  s  1  such 
that  Mi  (f  T(i),  then  3  some  ji  G  A/i  such  that  ji  f.  T(i).  Then 
from  equation  ([□}  H(X,\Xfl-i),X,jl)  <  H(Xi\Xf^)  -  e/2. 

Hence  minfcgSs_?(i)  H(Xi \Xf^,Xk)  <  H{Xi\Xf^i))-e/2. 

Therefore  the  control  goes  to  the  next  iteration  s  —  Si  +  1. 
However  after  the  last  neighbor  say  ji  is  added  to  T(i)  we 
have 


< 


\H(Xi\Xni),Xk)  -  H(Xi\Xf(i))\ 
\H{Xi\Xni),Xk)-H{Xi\Xni))\ 


4 

(12) 


for  any  k  G  Si  —  T(i).  Thus  ji  is  added  to  A/,  variable 
complete  is  set  to  TRUE  and  the  control  exits  the  inner  while 
loop  going  to  the  next  iteration  r  =  r  +  1.  Proceeding  similarly 
one  neighboring  node  is  discovered  in  each  iteration  r  =  1  to 
r  =  Aj.  At  r  =  Aj  +  1,  N(i)  =  A/i.  Thus  in  step  s  =  1, 
T(i)  =  Mi,  so  the  entropy  cannot  be  reduced  further.  Hence 
variable  iterate  is  set  to  FALSE  and  control  exits  the  outer 
while  loop  returning  the  correct  neighborhood  N(i)  =  Mi. 

Lemma  |6|  bounds  the  number  of  steps  in  each  iteration  by 

2  log  1*1 

Now  taking  S3  =  S,  s  =  21°s  ^ ,  £  =  |  in  Lemma  [il  we 

(1  VI2  log  l-^l/e  6  \  ^ 

- — ! — ^5 - log  |  J ,  P(£)  <  <5. 

Therefore  with  probability  greater  than  1  —  5  the 
RecGreedy{e)  correctly  recovers  G. 


Proof  for  FbGreedy(e,  a)  algorithm:  Define  £  =  {3S1  C 
V,\S\  <  £|  \H{Xi\Xs)  -  H(Xi\Xs)\  >  f }.  Let  s  denote 
the  number  of  iterations  of  the  while  loop.  When  £c  occurs 
we  have  for  any  i  G  V,  j  G  A/i,  A  C  V\{i,j},  |A|  <  £ 
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H{Xi\XA)-H{Xi\XA,Xj)  >  H{Xi \XA) 
~H(Xi \XAtXj)-™>^  (13) 

Again  we  prove  by  induction.  For  s  =  1  the  forward  step 
adds  a  node  to  the  conditioning  set  N  (i)  as  shown  previously 
for  the  RecGreedy(e)  algorithm.  Consider  iteration  s  >  1. 
Note  that  it  is  enough  to  show  the  following. 

•  In  each  iteration  the  backward  step  never  removes  a 
neighboring  node  j  £  Mi- 

•  After  the  last  neighbor  is  added  to  the  conditioning  set 
N(i)  the  backward  step  removes  all  non-neighbors  if  any. 

From  equation  (fl3)  it  is  clear  that  removing  a  neighboring 
node  j  £  N(i)  f~)7\4  increases  the  entropy  by  at  least  ^  >  ff. 
Hence  a  neighboring  node  is  never  removed  in  the  backward 
step.  If  there  exists  a  non-neighbor  l  £  N(i)  such  that 
H(Xi\X^^l)-H(Xi\X^^)  <  and  it  produces  the  least 

increase  in  entropy  then  it  gets  removed  from  N(i)  and  we  go 
to  iteration  s+1.  This  continues  till  the  forward  step  had  added 
all  neighbors  j  £  Mi  to  the  conditioning  set.  After  adding  the 
last  neighbor  to  the  conditioning  set  equation  ([12)  ensures  that 
the  forward  step  adds  no  other  nodes  to  the  conditioning  set 
N(i).  If  N(i)  =  Mi  we  are  done.  Else  for  any  non-neighbor 
j  £  N(i)  we  have, 


<  \H(Xi\Xmsj)-H(Xi\Xm)\  + 

^  ae  ae  ae 

=  0  +  T  =  T<Y 


(14) 


Hence  the  backward  step  will  remove  j  from  the 
conditioning  set  (or  any  other  non-neighbor  that  produces  the 
least  increase  in  entropy).  This  occurs  till  all  non-neighbors 
are  removed  and  N(i)  =  Mi  when  neither  the  forward 
or  the  backward  step  works.  The  flag  complete  is  then 
set  to  TRUE  and  Algorithm  [2]  exits  the  while  loop  giving 
the  correct  neighborhood  of  node  i.  Again  from  Lemma  [7] 
the  number  of  steps  required  for  convergence  is  bounded 
As  shown  previously  for  the  RecGreedy(e) 


by 


algorithm  from  Lemma 


c  =  ‘f  for  n  =  n 


with  S3  =  S,  s  = 


(1  — a)ct4e5 


*  log  1*1 

(1  — a)e 


and 


log  §  )  the  probability 


of  error  P{8)  <  S.  Therefore  the  FbGreedy(e,  a)  succeeds 
with  probability  at  least  1  —  S.  This  completes  the  proof. 


Proof  for  GreedyP(e)  algorithm:  The  proof  is  similar  to  that 
for  the  RecGreedy(e)  algorithm.  Let  event  8  be  as  defined  in 
the  proof  for  RecGreedy(e )  algorithm.  When  event  Ec  occurs 
the  Greedy(e)  runs  till  all  neighbors  are  added  to  the  set  N(i). 
Then  for  non-neighbors  j  £  N(i) 


Hence  j  is  removed  from  N(i).  But  for  any  neighbor  k  £ 

M(i) 

>  mXilX^-HiXilX^l-6- 


Thus  the  neighbors  are  not  eliminated.  The  algorithm  ter¬ 
minates  after  all  non-neighbors  have  been  eliminated.  The 
probability  of  error  is  upper  bounded  by  P{8)  <  6  with 

— — - log  | )  ■ 

■ 

Theorem  @| 

Proof:  First  consider  the  RecGreedy(e)  algorithm.  With 
probability  1  —  5  Algorithm  [T]  finds  the  correct  neighborhood 
of  each  node  i.  In  this  case  from  Lemma  [6]  the  number  of 
steps  in  each  recursion  is  0(4),  the  search  in  each  step  takes 
O(f  )  time,  number  of  recursions  is  at  most  A  and  the  entropy 
calculation  takes  O(n)  time  for  each  node  i.  Hence  the  overall 
runtime  is  O(-A^n).  When  the  algorithm  makes  an  error  with 
probability  S  the  number  of  steps  and  the  number  of  recursions 
are  bounded  by  0(£).  Hence  the  overall  expected  runtime  is 

0(<5p£3n  +  (l-(5)fA£n). 

For  the  FbGreedy(e,  a)  algorithm  from  Lemma  [7]  we 
know  that  the  number  of  steps  is  0(  /jJAg)-  The  search  in 
either  the  forward  or  backward  step  is  bounded  by  £  and 
the  entropy  calculation  takes  0(n)  time.  Hence  when  the 
algorithm  succeeds  the  run  time  is  y^C71)-  Note  that 

even  when  the  algorithm  fails  with  probability  S,  we  can 
prevent  going  into  infinite  loops  by  making  sure  that  once  the 
forward  step  stopped  it  is  never  restarted.  Hence  the  number  of 
steps  will  still  be  0(  and  the  overall  runtime  remains 

the  same.  Thus  the  expected  runtime  is  0( 

In  GreedyP(e)  algorithm  when  it  succeeds  with  probability 
1  —  S,  for  each  node  i  £  V,  the  Greedy(e)  takes  at  most 
21°s^  steps.  In  each  step  search  set  is  bounded  by  £  and  con¬ 
ditional  entropy  computation  takes  0(n )  time.  After  greedy 
algorithm  terminates  |iV(z)|  <  21°s^  since  one  node  has 
been  added  in  each  step.  Hence  number  iterations  in  pruning 
step  is  bounded  by  21°s^  and  again  conditional  entropy 
computation  take  0{n)  time.  Hence  the  total  run-time  is 
0(2 £n+  -n)  =  0(2(£  +  l)n).  When  error  occurs  the  number 
of  greedy  steps  and  pruning  iterations  is  bounded  by  £.  There¬ 
for  the  expected  run-time  is  0(<5p£(£-t-l)n+(l  —  <$)2(£-(-l)n). 

■ 

Proposition  |T] 

Proof:  Define  H{a)  =  alog(4)  +  (1  —  a)log(Y4— )  for 
0  <  a  <  1.  Then  simple  calculation  shows  H(X oI-X’d+i.)  = 
H(p)  and  ^(AolA'!)  =  H(q)  where 


< 


iHiX^X^-HiXilX^f 

\H(Xi\Xmxj)-H(Xi\Xm)\  +  ^ 


P 


2d+i 

2D+l  +  2(e2e  +  e~2B)D 
2d  +  2e~2S(e2e  +  e-29)D~1 
2d+1  +  2(e2S  +  e~2e)D 


q 
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Note  that  p  <  |  and  q  <  Since  H  (a)  is  monotonic 
increasing  for  0  <  a  <  H(X0\Xd+i)  <  H(X0\Xi)  iff 
p  <  q.  This  implies 


2d+1  <  2°  +  2e~29(e2e  +  e-29)D~1 

/p2 e  ,  „-2 8'  D~l 
2  <  l+e~2ele  +e 


e29  < 

D  > 


log( 


e29  +  e~29 
2 

29 

20  _i_-j  —  20 


D- 1 


e2e+e 


1  = 


29 


log  cosh  (29) 


+  1 


Proposition  [2] 

Proof:  For  each  node  i  G  V  its  distribution  can  be 
made  conditionally  independent  of  all  other  nodes  except 
the  neighbors.  In  order  to  find  X'(i)  the  CVDT  algorithm 
maybe  run  within  the  super-neighborhood  set  Si  to  reduce  its 
computation  complexity.  The  minimum  size  of  the  separator  is 
upper  bounded  A.  Let  |Si|  <  AR.  Then  CVDT  searches  over 
all  possible  conditioning  set  of  size  A,  which  takes  )  = 
0(AAR)  iterations.  For  each  conditioning  set  it  takes  O(n) 
time  to  compute  the  conditional  variation  distance  and  |<T|A 
iterations  to  find  the  maximum  conditional  variation  distance. 
Therefore  the  expected  runtime  is  0(pAlyA+l^R\X\An). 


